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~Q\ : 

f*^ i Abstract. A p-adic modification of the split-LBG classification method is 

f*^ , presented in which first clusterings and then cluster centers are computed 

^v^j ■ which locally minimise an energy function. The outcome for a fixed dataset 

is independent of the prime number p with finitely many exceptions. The 
Ch ' methods are applied to the construction of p-adic classifiers in the context of 

, learning. 

-XT' 1 

<n : 

' 1. Introduction 

O: The field Q p of p-adic numbers is of interest in hierarchical classification because 

i— J ■ of its inherent hierarchical structure [TU] . A great amount of work deals with finding 

c/3 [ p-adic data representation (e.g. [H[9]). 

In gj, the use of more general p-adic numbers for encoding hierarchical data was 
advocated in order to be able to include the case of non-binary dendrograms into 
' the scheme without having to resort to a larger prime number p. This was applied 

in [5] to the special case of data consisting in words over a given alphabet and where 
proximity of words is defined by the length of the common initial part. There, an 
qq \ agglomerative hierarchic p-adic clustering algorithm was described. However, the 

■ question of finding optimal clusterings of p-adic data was not raised. 

\ Already in pQ , the performance of classical and p-adic classification algorithms 

■ was compared in the segmentation of moving images. It was observed that the 
^\ [ p-adic ones were often more efficient. Learning algorithms using p-adic neural 

networks are described in [2l [6] . 

Inspired by OQ, our main concern in this article will be a p-adic adaptation 
k>( ( of the so-called split-LBG method which finds energy-optimal clusterings of data. 

$_j ■ The name "LBG" refers to the initials of the authors of [7], where it is described 

first. Their method is to find cluster centers, and then to group the data around 
the centers. In the next step, the cluster centers are split, and more clusters are 
obtained. This process is repeated until the desired class number is attained. For 
p-adic data, this approach does not make sense: first of all, cluster centers are in 
general not unique; and secondly, because the dendrogram is already determined 
by data, an arbitrary choice of cluster centers is not possible — this can lead to 
incomplete clusterings. Hence, we first find clusterings by refining in the direction 
of highest energy reduction, until the class number exceeds a prescribed bound. 
Thereafter, candidates for cluster centers are computed: they minimise the cluster 
energy. The result is a sub-optimal method for p-adic classification which splits 
a given cluster into its maximal proper subclusters. A variant discards first all 
quasi-singletons, i.e. clusters of energy below a threshold value. The a posteriori 
choice of centers turns out useful for constructing classifiers. 
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A first application of some of the methods described here to event history data of 
building stocks is described in [3]. There, the classification algorithm is performed 
on different p-adic encodings of the data in order to compare the dynamics of some 
sampled municipal building stocks. 

After introducing notations in Section [2 we briefly describe the classical split- 
LBG method in Section [3] Section 0] reformulates the minimisation task of split- 
LBG in the p-adic setting, and describes the corresponding algorithms. The issue on 
the choice of the prime p is dealt with in Section Section [5] constructs classifiers 
and presents an adaptive learning method in which accumulated clusters of large 
energy are split. 

2. Generalities 

2.1. p-adic numbers. Let p be a prime number, and K a field which is a finite 
extension field of the field Q p of rational p-adic numbers. We call the elements of 
K simply p-adic numbers. K is a normed field whose norm | \ K extends the p-adic 
norm | | on Q p . Let Ok '■= {x G K \ \x\ K < 1} denote the local ring of integers of 
K . Its maximal ideal m,K = {% £ K \ \x\ K < 1} is generated by a uniformiser n. It 
has the property v(tt) — -, where e € IN is the ramification degree of K/Q p . 

All elements x G K have a 7r-adic expansion 

(1) x = ^ 

i>—m 

with coefficients cti in some set % C K of representatives for the residue field 
Ok/u\k — V p f. In the case q = p, the choice 3? = {0, 1, ... ,p — 1} is quite often 
made. 

By X will will always mean a finite set of data taken from K . 

2.2. p-adic clusters. A disk in some finite set X C K is a subset of the form 

{x £ X | \x ~ a\ K < e} 

for some a G X and e > 0. In particular, any singleton {x} C X is a disk in X. 

The cluster property of a subset C of p-adic data X C K is given by saying that 
for any a G C it holds true that 

(2) \x - a\ K < fi(C) =>ieC ; 
where 

fi(C) := max{|a; -y\ K \x,ye C} 

is the cluster diameter. As a consequence, a cluster is a union of disks in X. We will 
call a disk in X also a verticial cluster, because in the in the dendrogram for X , the 
vertices correspond to those clusters which are (non-singleton j3 disks. More to the 
dendrogram associated to p-adic data will be said in Section 14.11 In Figure [T] the 
ultrametric property of dendrograms is visualised as follows: data 6, c connected by 
a path consisting of vertical and horizontal line segments are considered as near, if 
the sum of the vertical parts is short. A third datum a further away from b and c 



In many definitions of dendrograms, the data correspond to terminal vertices, but in our 
definition in Section [4.1l data are not considered as vertices of the dendrogram. Nevertheless, we do 
not exlude singleton clusters from the definition of "vertcial" . We apologise for this inconsistency. 
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a b c 



Figure 1 . Dendrogram in which b, c are closer to each other than 
to a. It contains a subset which is not a cluster. 



a b c 



Figure 2 . Dendrogram with equidistant data contains non-verti- 
cial clusters. 

is, by ultrametricity, at equal distance to b and c. This fact is visualised by having 
paths a b and a ^ c with vertical components summing up to equal length. 

Example 2.1. Let X = {a,b,c}, and consider the subset C — {a,b}. In Figures 
[3 and\^ we assume two different dendrograms for our data X. In Figure the 
disks are the singletons, the set {b, c}, and the whole dataset X. Hence, C is not a 
cluster in the case of Figure^ because it does not satisfy the cluster property @): 
b and c are at distance less than the diameter which equals the distance between a 
and b, whereas C contains b but not c. However, in Figure^ all data are at equal 
distance, so the only disks are the singletons and X . Hence, C is a cluster in Figure 
[H but not a disk, i.e. not verticial. 

A clustering of X is a collection ^ of disjoint clusters of X whose union is the 
whole dataset X. It is called verticial, if it consists entirely of verticial clusters. 

Notice that the definition of cluster depends on the dataset X. In particular, a 
non- verticial cluster can be made into a disk by deleting some data from X . E.g. in 
Figure[5]the removal of c from the dataset turns C = {a, b} into a verticial cluster. 
In general, if ^ is a clustering of X, and Y C X, then := {C n Y \ C <E *if } is 
the restriction of to Y . This motivates us to consider only the case of verticial 
clusterings. 

Assumption. All clusterings we consider are verticial on some specified (non- 
empty) subsets of X. 

3. The split-LBG algorithm 
Here, we review briefly the classical split-LBG algorithm. Details can be found 
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Let X = {ai, . . . , a n } and C — {ci, . . . , Ck} be sets of vectors in R/™, where X 
is considered as the data and C are the prespecified cluster centers. The task is 
classically to find a partition 6 — {Cl c \ c G C} of X into k clusters f2 c minimising 
the energy 

e{0,c) = y j E d (°' c )' 

where d(x,y) is Euclidean distance in R m . In fact, the split-LBG method works 
with varying C by alternatively constructing partitions and then replacing each 
c G C by two new centers c + e, c — e, where e is a perturbation vector in R m of 
small norm. From these, a new partition is constructed, etc. 

4. Split-LBG in the p-adic case 

In [Tj it was observed that the split-LBG method has no direct translation using 
the p-adic metric. Here, we describe a p-adic modification of the task from the 
previous section. 

Let X — {x\, . ..,i„} C K be some data consisting of n p-adic numbers, and fix 
a number k. The task is to find a clustering ^ = {Ci, . . . , CV} of X with I < k, 
and for each cluster Cg^a center ac G C, minimising the expression 

E p (X,V,a) := ^ Z>-<*cl*:. 

where a = (ac)(7 e <^ is the sequence of cluster centers. 

Note that, by the ultrametric property of | \ K , cluster centers can (and will) 
always be chosen within X. This has already been taken care of in the definition 
of the task. Note further that, unlike in the Archimedean setting, cluster centers 
are in general not uniquely defined by their corresponding clusters. 

The most significant difference to the Archimedean case is given by the fact that 
in the p-adic situation, it does not make sense to choose a cluster center a priori, 
as illustrated in Example 14.11 Therefore, the order is reversed: first find a good 
partition, and then find corresponding cluster centers. 

Example 4.1. Let {a,b,c} be some data with corresponding dendrogram as in 
Figure^ Then choosing a,h as centers leads to the clustering c € — {{a},{6, c}}, 
whereas the choice b,c leads either to c € l = {{&, c}}, c £" = {{a,b,c}}, or to c €'" — 
{{6},{c}}. But ff' and^'" are not clusterings of {a, 6, c}, while < €" is. And both 
and c &" each consist of one cluster containing the two prescribed centers instead 
of two distinct clusters as should be the case classically. 

Last but not least, we will not give a global solution to the task in the p-adic 
case, but find certain types of local minima of E p in a sense which will become clear 
in the following subsection. 

4.1. Some definitions. An important tool in the classification of p-adic data 
X C K is its dendrogram D(X). In contrast to the Archimedean situation, it 
is uniquely determined by the data (cf. [H[S]). We view D{X) as a rooted metric 
tree. This means that it has a root , and all edges are oriented away from vq and 
are assigned a length which is either positive real or infinite. The root vo corre- 
sponds to the top cluster consisting of the whole data X. The vertices correspond 
to clusters containing at least two points from X. An edge e of D(X) connecting 



ON p-ADIC CLASSIFICATION 



5 



two vertices is always bounded. The individual points of X correspond uniquely to 
the ends of the tree D(X). We do not view the data X as part of the tree D(X), 
but as its boundary. Hence, any x G X sits at the one extreme of an unbounded 
edge. Our viewpoint is probably in contrast to most others on hierarchical classi- 
fication, where data correspond to terminal vertices of dendrograms. However, we 
argue in our favour that the dendrogram should reflect hierarchic approximations 
of data by clusters (vertices in D(X)) or, more generally, by initial terms in some 
p-adic expansion for data (points in D(X)). We refer to [H[S] for a more detailed 
description of p-adic dendrograms. 

Given some vertex v of D(X), let ch(u) denote the set of edges emanating from 
v (i.e. not towards «o), and let #ch(t>) be its cardinality. By abuse of notation, we 
will identify ch(v) with the set of vertices and ends attached to the edges in ch(v). 

Now, an upper bound for the contribution to E p of a cluster C v , represented by 
some vertex or end v is 

fi(v) := fx(C v ) = max {\x - y\ K \ x, y G C v }. 

As a side remark, note that this is nothing but the Haar measure of K evaluated 
in the p-adic disk D v C K corresponding to v. In any case, if v is an end then 
/j,(v) = 0, otherwise /j,(v) > 0. 

Given a set V of vertices or ends of D(X), we set 

(3) E(V):=J2i#^-l)-Kv), 

vev 

and also write E(v%, . . . , Vb) in the case that V = . . . , Vb}- Applying this to 
ch(i>) for a vertex v, we obtain: 

(4) E(ch(v)) < E(v). 

The following remark shows that minimising E(V) does make sense for our task: 

Remark 4.2. Given a clustering 'rf = {C v \ v G V}, where V is the corresponding 
set of vertices, for any choice of a v G C v it holds true that 

E p (X,tf,a) <E(V) =:E(tf), 

where a = (a v ) v <=v- 

Let Xk(Y) be the set of all clusterings ^ of X with cardinality I < k whose 
restriction to Y is verticial. On the set 

(5) I=U U X *( F )' 

kiENYCX 

of all clusterings, we define a partial ordering < (called refinement) as follows: 

if all C G ^ are of the form C = U C[ with C[ G <T (i e I). 

iei 

Let C v be the smallest verticial cluster containing a given cluster C . Then we 
can define the functional 

E: R, i-> ^ (# C - !) • M 6 '")' 
and observe that this obviously generalises E(V) from ([3]): 
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Lemma 4.3. If'tf el is verticial, then 

E(V) = E(V), 
where V is the vertex set associated to . 
Lemma 4.4. E is strictly monotonic: 

<? <tf^E(%") <E{tf), 
and if%" < c g are not equal, then E(^") < E{<g). 

Proof. Assume C = [j G[ 6 <£ with C[ 6 . Then 

iei 

J2 *{C{ - I) ■ m(C(J < *( C 'i - !) • v(C v ) ^ (* C - !) • V(C V ), 
iei iei 

where the first inequality holds true, because all C[ are contained in C . The second 
inequality is strict, if I contains more than one element. That is the case for some 
C, if ^ □ 

We denote by Scythe restriction of E to Xk(Y). The following is immediate: 

Lemma 4.5. Let and^' minimise Ek,y o,ndEk' t y, respectively. Then 

k<k' => E(^") < E(tf). 

4.2. The verticial clustering algorithm. The general strategy which we follow 
is to refine a given clustering of X in the "direction" which yields the lowest value of 
E p after splitting a vertex. The term "direction" refers to the refinement ordering 
on X, and we follow the possible "gradients" from a given point f£l Concretely, 
this means splitting a vertex with highest energy contribution. In Section we 
will see that the terms in quotation marks here can be taken ad literam. 

In this subsection, we deal with verticial clusterings only. We can now formulate: 

Algorithm 4.6 (Verticial clustering). Input, p-adic data X C K with #A > 2, 
and upper bound k > 1 for number of clusters. 

Step 0. Compute b = #ch(vo) and E(vq) = h(vq). 

Step 1. If b > k, then terminate. Otherwise, compute _E(ch(u )) which is not greater 
than E(vq) by (H]). Further identify the set of vertices V\ := ch(w ) fl Vert(D(A)). 

Step N . Assume that from the previous step, we are given some family j^jv— i = 

{Vzv-i} °f se ^ s consisting of b^_ l < k vertices, respectively. If for all i and all 

v e Vjyli it holds true that b$ := b^_ ± + #ch(v) > k, then terminate. 

Otherwise, find all i and all v e V^_ 1 such that E(Wv ) is smallest possible, 

where wi l) := ch(u) U V$_ x \ {v} satisfies #W„ W < k. Again, by g]), it holds true 
that 

EiW^KEiV^). 

Extract this new family Yn of vertex sets together with the lower energy value 
E N = E{W) for W G Yn- 

Output. A family of clusterings | i € 1} (corresponding to the vertex sets in the 
last step) for which E = E{f£) is locally minimal, together with the value of E. 
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4.3. p-adic cluster centers. The next objective is to find cluster centers with 
respect to the energy functional. Assume that we are given a fixed cluster C = 
{ai, . . . , a n } C K. We wish to find some a G C which minimises 

e(a) := E p (C, <tf, a) = 2J \a - a\ K , 

where = {C}. 

A branch B of a rooted tree (T, v) is a maximal subtree of T \ {v}. It has a root 
Vb among the vertices of ch(v). Let ®(T) denote the set of branches of (T,v). In 
the case of our dendrogram D(C), we will write 25(C), instead of H(D(C)). The 
branches induce a natural partition of C: 

C= |J C B 
into a disjoint union of Cb = Ends(B). 

Lemma 4.7. Let a G C , and B a G 53(C) £/ie branch containing a as an end, and 
C a = C'B a ■ Then 

(6) e(a) = #(C\C a )-v(v )+E p (C a ,tf a ,a), 
where ^ Q = {C a }. 

Proof. Together with the identity: 

2J |o - ct\ K = Ep(C a ,^ a , a), 
aec a 

this follows easily by looking at the tree D(C). □ 
Lemma 4.8. Assume the notations as in Lemma \4. 7| It holds true that 

(7) M- = N a + 0{p^) 
with N a G IN and v a < 0. 

Equation ([7]) means that is a natural number plus some small term given 

as a multiple of p Va . 

Proof. Set N a = #(C \ C Q ), and notice that 

(8) E p (C a ,V a ,a)<#C a -ti{v a ), 

where w Q is the root of B a . The claim now follows from the obvious inequality 
/j,(v a ) < v(v ). □ 

Now, we can formulate our algorithm: 

Algorithm 4.9 (Cluster centers). Step 1. Find all branches B^ G 23(C) with 
largest value of #C B a) ■ Extract those clusters C B w for which fj,(v B (i) ) is minimal, 
and the number 

Cl =max{#C B(1) | flW G 15(C)}. 

SYep iV. Assume that in the previous step, a list of clusters Cg(w-i), and a num- 
ber cjv-i is produced. Find all branches B^ N ' of the rooted trees D(C B (n-i)) 
with largest possible value cn of #C b <n). Extract those clusters C B (n) minimising 
h{v B {n)), together with cat. 
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At some point, there will be a Step N 1 in which the trees D(C B {n)) have only one 
vertex each. The procedure terminates thus: 

Output. A list (Cj)j E / of those clusters from Step N' with minimal value of fi(vi), 
where Uj is the vertex of D(Ci). 

Theorem 4.10. Let C = Cn> Q C be a cluster produced by performing Algorithm 
\4-9\ Then any a G C is a center of C with respect to E p . 

Proof. Let C — Cq D C% D . . . Cn 1 — C be a strictly decreasing chain of clusters 
produced by the N' steps of Algorithm 14.91 Let the corresponding cardinalities be 
Co, ... , Cjv- By applying Lemma 14.71 it holds true that 

N' 

(9) e(a) = c N , ■ ju(ujv') + ^{c^i - c,) ■ 

i=l 

where Vj is the root of the corresponding branch from Step j. The minimality of 
e(a) is guaranteed by 0, applied to each step. Notice, that we have used the 
obvious fact that for C", the inequality © is an equality. □ 

4.4. Quasi-verticial clustering. The two previous subsections already lead to a 
p-adic algorithm for verticial clusterings and their centers. In this case, subdividing 
a cluster C v means to make as many subclusters as there are elements in ch(i>). 
In the case that e.g. there are many singletons, this can be a disadvantage. Hence 
removing singletons provides more flexibility in that the bigger subclusters can 
either be merged or kept distinct. Even greater flexibility can be achieved if almost 
indistinguishable clusters are treated as singletons. 

Definition 4.11. Fix some real e > 0. A verticial cluster C v C X with corre- 
sponding vertex v is called a quasi-singleton for e, if E(v) < e. 

When we speak of a quasi-singleton, we mean a quasi-singleton for some e known 
from the context. 

Example 4.12. The dendrogram in Figure^ contains a guasi- singleton {a, b}, if 
we set /j,(v) = p~ e for vertex v at level I (indicated by the number at the left), and 
p^ 1 < s < p ■ For this choice of e, the cluster {c, d} is not a quasi-singleton. But 
this is the case for larger e. 

Clearly, any singleton is a quasi-singleton for any e. Since we are working with 
a fixed p-adic field K, it is possible to choose e so small that the quasi-singletons 
are precisely the singletons of our given dataset X. 

The algorithm we propose in the following removes quasi-singletons in order to 
continue with verticial clusterings. For this, we fix some notation: When referring 
to a subset Y of our dataset X, we will indicate this by the subscript Y. E.g. 
c\iy(v) means the set of edges in D(Y) going out from v. Similarly, with fj,y{V), 
E Y (V) etc. 

Algorithm 4.13 (Quasi-verticial clustering). Input. Data Xq := X C K, and 

numbers k := k > 1, e > 0. 

Step 1. Remove from D(X) all v 6 chx (vo) corresponding to quasi-singletons 
for e. Let si be the number of vertices removed. Extract corresponding reduced 
dataset X\ C Xq, as well as ch^^o), Ex 1 {vq) = /!jfi(fo)j and fci := k — s\. 
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0- 



1- 



2- 



3- 



4- 



a be d 

Figure 3. Dendrogram with quasi-singleton {a, b} for p~ A < e < p -1 . 
S'iep iV. Assume that in the previous step, we are given a quadruple of families 

("^JV-l, J£jV-l,-EjV-l,<#N-l) 

of sets V £ "¥n—i of vertices in D(X), datasets X(V) € 3^n-i, an energy value 
En-i — E X (v)(V), and numbers kN-i{V) < k (where V € Yn-i)- Remove for all 
V E ^n-i from D{X{V)) all vertices in ch X (v)(v) corresponding to sat(v) quasi- 
singletons, where v G V. Find all V £ ifa-i and v EV such that 

(1) fc W -i0O - s N (v) > 0, and 

(2) E X (v){W v ) < E X -i is smallest possible, 

where W v := ch(v) L) V \ {v} . Extract corresponding quadruple of families 

of new vertex sets W v , reduced datasets X(W V ) C X(V), energy value E x — 
E(W V ), and k N (W v ) := fcjv-i(F) - sjv^)- 

Output. A list of clusterings consisting of quasi-singletons for e and clusters pro- 
duced above by collecting the remnants in each step. 

Remark 4.14. The output clusterings of Algorithm \4-13\ all have energy of the 
form 

E + 0( P a ), 

where E is independent of the clustering, and a < is small. 

We can now put things together in order to find clusterings in different ways: 

Algorithm 4.15 ( (Quasi-) Verticial split-LBG p ). Input. As in Algorithm ^. 61 (resp. 
Algorithm 14. 13p . 

Step 1. Perform Algorithm 14.61 (resp. Algorithm 14. 13() . 
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Step 2. Perform Algorithm 14.91 for each cluster occurring in each clustering given 
out in the previous step. 

Output. A list (^j, (a.j of -E-suboptimal clusterings with corresponding list 

of i?-center vectors (aj- for clustering 

Both subroutines, Algorithms 14.61 and 14.91 boil down to counts and evaluations 
of (i(v) for vertices v. Therefore, we remark: 

5. Dependence on the choice of the prime p 

A natural issue is, how the outputs of the algorithms introduced in the previous 
sections depend on the choice of the prime number p. We will prove a finiteness 
result. 

Recall that the energy of a verticial cluster CV is of the form 
(10) E(C v )=A-p- v 

with natural numbers A and v, and is additive on disjoint unions of clusters. Split- 
ting a cluster is performed by replacing vertex v by the vertex set ch(u), and the 
change in energy is given by 

E ncw = E old - E(C V ) + E(ch(v)), 

i.e. the difference is 

S V E P := E(C V ) - E(ch(v)). 
Our approch towards minimising E p is to refine the given clustering in the direction 
of largest 5 v E p . Now, the quantity S v E p depends on the prime number p as shown 
by (fTu| . This means that different p can result in different rankings of the vertices 
by the order in which they are split. We call this the p-ranking of the vertices of 
D(X). 

Example 5.1. Assume we want to find verticial clusterings of data 

X = {xi, . . . ,£13} 

having underlying dendrogram as in Figure [JJ Consider the vertices a, b, c, d in 

0- 1 1 

1- 1 1 1 1 1 1 

2- 1 — I — 1 1 — I — 1 1 — I — 1 1 1 

x\ x 2 x 3 Xi x 5 x e x 7 x$ x 9 x w xn x 12 x vi 

Figure 4. A dendrogram. 

the underlying rooted vertex tree as depicted in Figure Then Table [7] shows the 
different p-rankings of these vertices for p — 2, 3 and 5. 

Theorem 5.2. For all but finitely many primes, the p-rankings of the vertices of 
a given dendrogram D{X) belonging to data X taken from a fixed p-adic field are 
the same. 



ON p-ADIC CLASSIFICATION 



11 




Figure 5 . Vertex tree underlying Figure |H 
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Table 1 . Vertex rankings for Figure |4j 



Proof. The energy gradient for a vertex v can be written as 

S V E P = P v (t)\ t= i 
p 

for some polynomial P v (t) whose coefficients are natural numbers. By dividing off 
powers of t, we may assume that P v (t) has a non-zero constant term, hence that 
P v (0) > 0. By the considerations from the previous sections, we know that 

(11) 0<P V (-) <P V (0) 



for all primes p. By viewing P v (t) as a continuous function on the intervall [0, 1/2], 
we see from the right inequality in (fTTjl that P v (t) must be decreasing on some 
interval [0, x] with positive x < \ sufficiently small. It follows that the sequence of 

values P v for prime p — > oo converges to P„(0). Since that limit equals E(v) 
on the maximal subtree of D(X) having v as its root, we have proven 

lim P v (-) =E(v). 

p^oo \p J 

In other words, for sufficiently large prime p, the vertex gradient can be approxi- 
mated by the vertex energy. Hence the ranking of the vertices is approximatively 
the ranking of the numbers 

where i(v) depends on the level of v in the dendrogram. The latter ranking does 
not change once p is sufficently large. Hence, for large p the vertex ranking does 
not change. □ 
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Remark 5.3. Notice from ilty) that using a large prime number tends to force 
splitting vertices higher up in the hierarchy underlying the dendrogram. On the 
other hand, taking a small prime number allows to split also clusters containg lots 
of data at low levels in the hierarchy. 

Theorem 5.4. Let C C K be a cluster. If a is a center of C with respect to E p 
for some prime p, then it is a center for all primes. 

Proof. From Lemma (|4.7|) it follows that 

e p (a) = E p (C, tf, a) = ^ a v^(v), 

where V is the set of vertices on the path 7 from the top vo down to a. As 
= p~^ v \ and the £(v) form a strictly increasing sequence £0, ■ ■ ■ , Im of natural 
numbers as v proceeds along 7, it follows that e p (a) is given by evaluating the 
polynomial 

M 

Fl {t) = Y. a i tU 

t=0 

in t = i, where a] > equals that number a v with v such that t{y) — £j. Now, 
e p (a) being a minimum means that in the collection 

{F 7 (t) I 7 path v ^X} 

the term a^t is of lowest degree and that coefficient q.q is smallest among those 
terms of lowest degree. And this does not depend on the choice of prime p. □ 

6. p-ADIC LEARNING 

In this section we discuss a learning situation in which some p-adic data X C K 
together with a clustering Ifx is use d as a "training set" . The idea is to classify 
new data Y taken from some p-adic field L D K on the basis of X and ^ . Without 
loss of generality we assume that the two p-adic fields L and K coincide. 

6.1. p-adic classifiers. Learning can be performed by using a classifier which in- 
tegrates new data y € Y into an existing dendrogram D(X) in order to find a 
suitable cluster for y. We will define such in the p-adic situation. 

As it may happen that adjoining a point y E Y to X increases the size of the 
smallest p-adic disk containing the training data X, we use the point at infinity 
already introduced in [4j. This allows to classify those data in Y which cannot 
be classified on the basis of (X^x) as belonging to the "cluster at infinity". Our 
method will use the extended dendrogram 

D^X) = D(F{X)), 

where P(X) = X U {oo]Q. The datum 00 will be depicted at the end of a path 
going upwards from vq, whereas all other data will be at the end of paths leading 
downwards. 

Example 6.1. In Figure^ some datum y is adjoined to a training dataset X — 
{a, 6, c}. As it happens that the distance of y to X is larger than the diameter of 
X , the path vq ~> y in the dendrogram D^X U {y}) has a portion going upwards 
in direction 00. 

2 Note that D oc (A') is what is denoted by D(X) in g][5]. 
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Figure 6. Dendrogram with cluster at infinity. 

We call the pair X := (X^x) a classification and have a classification map 

KX : Ppf)-*^, x»C x , 
which assigns to each x £ P(X) the cluster C x containing x, with 

V? = V X U {{oo}}- 
Now, let Z = X U Y. We have the inclusion map t: P(X) -*■ P(Z) which takes 
x £ X to itself and oo to oo. 

Definition 6.2. A p-adic classifier for Y modeled on (X,1fx) is a map 

A: P(Z) ->«", 

where c € l is a clustering ofV(Z), such that there exists an injective map <p: c £x ~ > 
*(?' making the diagram 

P(X) — ^P(Z) 



KX A 




commutative. The cluster Coo '■= A 1 (0({oo})) is called the residue of X. A classi- 
fier is called saturated, if (f> is bijective. 

Remark 6.3. Notice that (f> is unique if it exists. 

Our first learning algorithm constructs the classifier sequentially by computing 
the distance to cluster centers for *€x- Let A = {ac | C £ ^x} be the set of given 
cluster centers ac £ C. Then we have for y £ Y the map 

d y : % J X ->R, \y-a c \ K , 

and let m y := mind J/ ( < ^x). 

The vertex v v £ D ao (A U {y}) nearest to y can be found e.g. using the p-adic 
expansions as given by (|TJ) - Namely, a vertex corresponds to a disk containing 
two or more p-adic numbers in A U {y} having common initial terms determined 
by the radius of the disk. In geometric terms, traversing along the geodesic path 
7 y : oo ~> y until all a £ A have branched off 7 y yields the vertex v y , and fi{v y ) is 
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determined by the subset C Vy C A of those elements branching off precisely in v y . 
The length of the path vq v y gives m y . And the map d y is computed: 

Lemma 6.4. It holds true that 

% := d-\m y ) = {C a e fx I a G C Vy } . 

Proof. By what has been said above, the minimum is attained precisely for those 
clusters C G contained in C v . Hence C — C a for some a G C v . □ 

The task is now to decide into which cluster from ^ to put y. 

Algorithm 6.5. Input. A classification Xo := X = (A, ffx), a set A — {ac | C € ffx} 
of cluster elements ac E C, and a set Y C A" of cardinality A. 

5iep 0. Set := {oo}. 

5<ep i. Take y := y± G Y, and compute C Vy , ffy, and 

Case 1. If Ct,^ = {a}, then set C y := C a U {y} and Ax := A. 

Case 2. If #C„ > 1, then find the subset C y C C^^ of all elements whose nearest 
vertex in D 00 (C Vy U{y}) equals w y . If C y = 0, then set C y = {y} and A% := AL){y}. 
Otherwise, find all elements a G C v with minimal energy E(C a U {y}). If there is 
more than one such a, then C y := {y} and Ai := Au{y}. Otherwise, C y := C a U{y}, 
and Ai := A. 

In any case, produce Yj. := Y\{y}, Ai and classification Xi := (Ji,^), where 
Ai = X U {y} and tf Xl ■= {C y } U^x\ {C a }. Terminate, if Y 1 = 0. 

Step N. Assume that in the previous step, sets Y/v_i, A^-i and a classification 
Xat_! have been given out. Then perform Step 1 with X := Xjy-i, A := Ajv-i, 
and Y := Yjv-i- 

Output. On termination in Step M, an optimal classifier 

A: P(I M )^% M , x^C*, 

modeled on Xo- 

Proof of optimality. In each step A, yjv £ Yv is assigned to the cluster C G 
with minimal energy A(C U {i/tv})- D 

Theorem 6.6. The outcome of Algorithm 1 6. 5] does not depend on the choice of the 
set A of cluster representatives. 

Proof. The outcome of Step 1 does not depend on A. □ 

Remark 6.7. A consequence of Theorem \6.6\ is that Algorithm \6.5\ does indeed 
effect learning in the sense, that to any y G Y is assigned a cluster depending 
on the already existing clusters. Representing a cluster by a single element makes 
learning efficient. 

6.2. Adaptive learning. During the learning proces^, it can become useful to 
subdivide big clusters of the extended dataset X U Y. This is not a problem, as the 
old cluster centers can be reused in the new clustering. 

Lemma 6.8. Let C be a cluster, and a G C a center of C . Assume that C is a 
subcluster of C containing a, then a is a center of C . 

^Or if for some reason one wants to perform a variation of split-LBG p in which centers are 
computed after each clustering step, instead of after termination of clustering. 
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Proof. Clearly, it holds true that 

(13) E p (C,tf,a) <E p (C,tf,a'), 

where c € = {C} and <£' = {C}. Assume that a' G C is a center of C. Now, 
inequality (|13[) implies that 

E p (C',tf',a) + J2 \x-a\ K ^E p {C,^,a) 

x£C\C 

<E p (C,tf,a') 

= E p (C',V\a')+ \ x ~ a '\K 

x£C\C 

Since, by the cluster property of C", it holds true that 

\x — a\ K = \x — a'\ K 
for all x e C \ C, it follows that 

(14) E p (C',r,a) <E p (C',r,a'), 

and, because a' is a center of C", this yields an equality in (fl4l) i.e. a is a center of 

a. □ 

Remark 6.9. Notice that Lemma 1 6. 8\ does not hold true, if we allow C to be an 
arbitrary subset of C . E.g. assume in Figure^ that C — {a, b, c, d}. Then a is a 
center of C , as can be verified from the left dendrogram. However, a is not a center 
of C = {a,c,d}, as the right dendrogram reveals. Namely, in the first case, we 
compute with = {C} and c €' = {C}: 

E{C, a) = E(C, <T, b) = \a - b\ K + 2 • |a - c\ K 

<\c-d\ K + 2-\a- c\ K = E{C, <€, c) = E(C, tf, d), 

and in the second case: 

E{C'X',c) = \a-c\ K + \d-c\ K 

<2 - \a-c\ K =E{C'X,a). 



Figure 7. Dendrogram and subdendrogram. 

At last, we propose the splitting of high-energy clusters accumulated during the 
learning process: 
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Algorithm 6.10. Input, r > 0. Otherwise, as in Algorithm 16.51 
Perform Algorithm 16.51 with modification: 

Step N' . Perform Step N. If for y :— it holds true that E(C y ) > r, then 
split cluster C y into its maximal proper subclusters, and adjoin to An new cluster 
centers using Algorithm 14.91 

7. Conclusion 

A straightforward translation of the split-LBG algorithm to the situation of 
classifying p-adic data does not exist. However, if clusterings, cluster centers and 
their numbers are allowed to vary, then the minimisation problem for the p-adic 
energy functional defined by distances to centers does make sense. Sub-optimal 
algorithmic solutions to the minimisation problem are presented, in which the choice 
lies in whether or not to remove in each step quasi-singletons, i.e. clusters which are 
almost singletons because of their energy values being lower than a given threshold. 
The method is to find rankings of vertices in the dendrogram associated to the p- 
adic data. The outcome depends on the prime number p, but it is shown that 
for all but finitely many primes the rankings are identical. The consequence for 
applications to data anlaysis is that for fixed prime p, the classification results do 
not depend on the p-adic representation of the data, as long as the dendrograms are 
isomorphic. Furthermore, the minimising property for given cluster centers holds 
true independently of the prime. This means that if some datum is a cluster center 
for one prime, it is a cluster center for all primes (for which the corresponding 
cluster is not larger). Using p-adic cluster centers, one can construct classifiers 
from given clusterings. This can be applied to learning situations. 
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